Boston fork: security hardening, upstream-portal protection, AWS hosting docs#1
Merged
brendanbabb merged 12 commits intomainfrom Apr 30, 2026
Merged
Conversation
… pin Move the deployment to us-west-2, add reserved Lambda concurrency as the primary brake on fan-out into the upstream CKAN portal, and pin Lambda packaging to cp311/manylinux wheels so the ZIP works regardless of build host Python version. - terraform/aws: add lambda_reserved_concurrency (default 10) wired to aws_lambda_function.reserved_concurrent_executions. Extract the S3 backend config out of main.tf; real backend.tf is gitignored because the bucket name embeds the deployer's AWS account ID. Ship backend.tf.example as the template. - prod/staging tfvars: aws_region=us-west-2, api_quota_limit=3000 (was 1000), lambda_reserved_concurrency=10. Prod custom domain is boston-data.codeforanchorage.org; staging has no custom domain. - scripts/deploy.sh + .github/workflows/release.yml: force cp311 manylinux wheel resolution on every pip/uv install (without this, a Python 3.14 build host produces a ZIP that 502s at Lambda cold start). Detect python3/python cross-platform. Build the ZIP with stdlib zipfile instead of the `zip` binary so the packaging step works on CI images and Windows. - scripts/setup-backend.sh: fix malformed bucket name (boston-opencontext-opendataterraform-state-... → boston-opencontext- tfstate-...). - config.yaml: replace symlink-to-example with a concrete Boston CKAN config targeting data.boston.gov. ArcGIS kept disabled for reference. - local_server.py: accept POSTs on both / and /mcp so the same local server works with Claude Desktop stdio bridges and MCP Inspector. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…size
Tighten the two attack surfaces that directly forward user-controlled
input into upstream CKAN: the execute_sql path and the aggregate_data
path. Add a request-body-size cap on the HTTP handler to bound the
work a single JSON-RPC call can cost. See docs/SECURITY.md for the
full threat model.
- plugins/ckan/sql_validator.py
* SQLValidator: shrink MAX_SQL_LENGTH 50000 → 8192; strip SQL
comments before keyword/function scans so /* ... */ and -- ...
obfuscation can't smuggle forbidden tokens past the checks;
expand FORBIDDEN_KEYWORDS with PREPARE/COPY/LISTEN/NOTIFY/VACUUM/
ANALYZE/CLUSTER/REINDEX/LOAD/DO; add FORBIDDEN_FUNCTIONS
(xp_cmdshell, pg_sleep, pg_read_file, pg_ls_dir, pg_stat_file,
lo_import, lo_export, current_setting, set_config, dblink);
walk the sqlparse AST to require every FROM/JOIN target to be a
UUID-quoted resource or a CTE alias (rejects schema-qualified
targets like pg_catalog.pg_class); match INTO OUTFILE/DUMPFILE.
* New enforce_row_limit: appends LIMIT 10000 to any validated SQL
that lacks a top-level LIMIT so a caller can't trigger an
unbounded scan on a multi-million-row CKAN DataStore table.
* New SafeSQLBuilder: typed, allowlist-only builder for the
aggregate_data path. Identifiers must match [A-Za-z_]\w*, metric
expressions must be count(*) or {count|sum|avg|min|max|stddev}
([DISTINCT] <ident>), filter values coerced per type with '
escaping, order_by parsed and quoted, limit clamped to 10000,
HAVING values must be numeric.
- plugins/ckan/plugin.py: route aggregate_data through
SafeSQLBuilder (was string concatenation); call
SQLValidator.enforce_row_limit after validate_query.
- server/http_handler.py: reject JSON-RPC bodies > 65 KB with
HTTP 413 before parsing. The MCP surface fits in a few KB; a
megabyte payload is either a bug or abuse.
- tests: cover body-size cap at and over the boundary, each new
forbidden keyword/function, comment obfuscation, schema-qualified
FROM rejection, enforce_row_limit behavior, and every
SafeSQLBuilder method.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Document the Boston fork's AWS hosting and security posture, with portal protection as the top design constraint. Add stdio_bridge.py as a Python alternative to the Go stdio client for Claude Desktop. - docs/AWS_DEPLOYMENT.md: how this fork is hosted (us-west-2, custom domain, reserved concurrency, cp311 packaging), what changed vs. upstream's single-region default, and how to operate the stack. Leads with the portal-protection design constraint. - docs/SECURITY.md: the full rationale behind the hardening changes, organized around who is being protected — upstream portal first, deployment second, end users third. Covers the SQL validator and SafeSQLBuilder surface, rate limits and body-size cap, privacy posture (stateless, no PII, 14-day log retention, SQL truncated to 500 chars), and known gaps. - README.md: link both new docs from the documentation table. - stdio_bridge.py: minimal Python stdio-to-HTTP bridge. Reads JSON-RPC messages from stdin, POSTs them to the local/remote MCP server, writes responses to stdout. Useful where the Go client is impractical (Windows, no Go toolchain). - CLAUDE.md: repo guidance for Claude Code sessions — commands, request flow, architecture notes, single-plugin rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds Method 4 to docs/TESTING.md covering how to wire the local HTTP server to Claude Desktop (claude_desktop_config.json) and Claude Code (.mcp.json) via stdio_bridge.py. The bridge was previously only mentioned in CLAUDE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ce_id, search_and_query)
GPT-4o struggles to extract a resource UUID from one tool result and pass
it to the next call. This patch makes the CKAN plugin friendlier to weaker
multi-step tool callers without changing existing tool semantics:
- Tool descriptions now end with explicit "Next step" guidance naming the
follow-up tool (ckan__query_data, ckan__get_dataset, etc.).
- Per-parameter descriptions spell out provenance ("the `id` field inside
the `resources` array returned by ckan__search_datasets").
- search_datasets / get_dataset formatted output leads with a NEXT STEP
block exposing suggested_resource_id and a literal suggested_call line.
- New composite tool ckan__search_and_query that chains search + query
server-side, eliminating the cross-call ID-passing problem entirely.
Tests: tool count 6 -> 7, three new cases for search_and_query happy path,
no-match, and dataset-with-no-resources. Full CKAN suite passes (32/32).
Boston attaches each dataset as 5-7 resources (GeoJSON, KML, SHP, PDF, ArcGIS REST, CSV) but only the CSV is loaded into the queryable Postgres datastore. The previous code blindly handed the model `resources[0].id`, which is the GeoJSON one for most parks/permits/etc. datasets, and the caller then got a 404 from datastore_search. Fixes: - New _is_queryable / _first_queryable_resource helpers. - search_datasets formatter: suggested_resource_id and per-dataset resource_id lines now point at the datastore_active resource only. When no resource is queryable, say so explicitly instead of suggesting a doomed UUID. - get_dataset formatter: NEXT STEP block points at the queryable resource; resource list labels each as [QUERYABLE] or [DOWNLOAD-ONLY] and surfaces the download URL for non-queryable ones. - search_and_query: walks search results and resources to find one with datastore_active=true (configurable via dataset_index/resource_index). Falls through to the next dataset rather than 404-ing on a download- only first match. - query_data: 404s now append a hint about the datastore_active gotcha pointing at get_dataset / search_and_query so the model doesn't keep retrying the same UUID. Tool descriptions updated to call out the queryable-vs-download distinction. Tests: 35 passing (was 32). Three new regressions: - parks-style multi-resource shape skips GeoJSON and queries CSV - walks to the next dataset when first has no queryable resource - query_data 404 includes the datastore_active hint Live smoke test against data.boston.gov confirms ckan__search_and_query "parks" now returns real Park_Features rows (Iacono/Readville Playground, East Boston Memorial Park, etc.) instead of 404.
Real failure observed: model asked for "311 service requests closed on
4/29", called ckan__search_and_query without filters, and got 10
unfiltered rows (almost none closed on 4/29; 85 actually were). Two
design problems:
1) `query` matches dataset metadata, not row content — but the model
read it as a row filter.
2) `filters` is equality-only against datastore_search, so it can't
express "close_date in [4/29, 4/30)" even if used.
Fixes (A + B + C from the proposal):
A) Tool descriptions explicitly distinguish dataset-metadata search
from row filtering and steer to the right knob (filters / where /
execute_sql) with concrete examples.
B) Every successful query_data / search_and_query response now ends
with a "Filterable columns" footer listing field names + types,
pulled from the datastore response. The next pivot to where /
execute_sql becomes a one-shot.
C) New structured `where` argument on query_data and search_and_query.
Routes through datastore_search_sql via SafeSQLBuilder; supports
eq, ne, gt, gte, lt, lte, in, not_in, like, ilike, is_null. Strings
are length-capped (256) and quote-escaped; IN-lists capped (100);
identifiers validated as before.
Live verification: search_and_query("311", where={close_date: {gte:
"2026-04-29", lt: "2026-04-30"}, case_status: "Closed"}) now returns
exactly 85 rows — matching ground truth from a manual SQL count.
Tests: +26 new (170 -> 196). Coverage:
- where-clause builder for every operator + edge case (injection
escaping, bad ops/types, oversized lists/strings, unknown operators)
- query_data routes through datastore_search_sql when `where` set
- query_data validation errors short-circuit before any API call
- schema footer surfaces in both SQL and non-SQL paths
Boston's 311 dataset attaches 22 queryable CSV resources (a rolling
'NEW SYSTEM' view plus per-year archives 2011–2026). The model only ever
saw whichever the auto-pick chose first — typically the rolling view —
which silently dropped older data on the floor for historical questions.
Two improvements:
(1) `_format_search_and_query` now emits an "Other queryable resources
in this dataset" block listing every datastore_active resource other
than the chosen one (name, format, resource_id), so the model can see
that '311 - 2020', '311 - 2019', etc. exist.
(2) New `resource_name` argument on search_and_query: case-insensitive
substring match against resource `name`. Takes precedence over
`resource_index`. With dataset_index pinned, a no-match returns a clean
error listing the available names.
Selection precedence (high → low):
resource_name > resource_index > first datastore_active resource
Tool description updated with concrete example
(resource_name='2020' picks the 2020 archive). The 'Other queryable
resources' block tells the model these names exist; the description
tells it how to use them.
Live verification: search_and_query("311", resource_name="2020") now
queries 311 SERVICE REQUESTS - 2020 (resource 6ff6a6fd-...) and returns
2020 records. Default no-resource_name query still picks the rolling
NEW SYSTEM and surfaces all 21 archive siblings in the response.
Tests: +3 (38 -> 41) covering resource_name match, no-match error, and
the siblings block (queryable-only, excludes download-only resources).
Real-world failure: bot answered "How many 311 requests closed on
4/29/2016?" with 100 — but 100 was the LIMIT; the true count is 531.
The model was reading "Found 100 record(s) (showing up to 100)" as
the answer to a counting question.
Fixes:
1. Capture CKAN's `total` from datastore_search responses (it's
already returned by CKAN; we were discarding it). Plumbed through
_query_with_schema → ToolResult → formatter.
2. SQL (`where`) path: when SELECT * hits the limit exactly we don't
know the true total, so issue a cheap SELECT COUNT(*) follow-up
with the same WHERE. This means we always tell the model a real
number when it asks a counting question via search_and_query or
query_data, regardless of which path was taken.
3. Formatter rewrite:
- Header line is now `total_matching_rows: N (returned K, limit=L)`
when total is known and exceeds returned. Removed the misleading
"Found 100 record(s)" wording that conflated rows-returned with
true count.
- When truncation is detected, prepend a `=== TRUNCATED ===` block
stating "the answer is N, NOT K" and pointing at
ckan__aggregate_data for cheap counts and ckan__execute_sql for
custom LIMIT/ORDER BY.
- When total can't be determined (count follow-up failed) and rows
== limit, prepend `=== MAY BE TRUNCATED ===`.
Live verification (boston prod CKAN):
- search_and_query "311 closed on 2016-04-29" with default limit=100
→ returns the 100 sample rows AND header "total_matching_rows: 531"
AND the TRUNCATED warning telling the model the answer is 531.
- search_and_query "311 closed on 2026-04-29" with limit=100
→ returns 85 rows, no TRUNCATED warning (85 < 100), header reads
"total_matching_rows: 85".
Tests: +5 (41 -> 46) covering total surfaced from datastore_search,
no-warning path when under limit, COUNT(*) follow-up fires only when
truncated on the SQL path, and graceful fallback when count fails.
User flagged: limit-as-count confusion isn't unique to query_data.
Same trap exists in execute_sql ("returned 100 because the SQL said
LIMIT 100") and in search_datasets ("found 20 datasets" while CKAN
knows there are actually 47 matches, since package_search returns
`count`). Switched to the user-suggested 'X of Y' phrasing throughout.
Changes:
1. Shared _format_count_header helper renders three cases:
- "100 of 531 rows returned (limit=100; raise limit or use
ckan__aggregate_data for full count)."
- "85 rows returned (full result, limit=100)."
- "100 rows returned (limit=100, true total unknown — see
TRUNCATED warning above if any)."
Used in query_data, search_and_query, and search_datasets
(with unit="matching dataset(s) shown").
2. search_datasets now reads CKAN's `count` from package_search and
surfaces it: "5 of 21 matching dataset(s) shown" instead of
"Found 5 dataset(s)".
3. execute_sql now extracts the effective LIMIT from the validated
SQL via SQLValidator.extract_top_level_limit() and emits a
"MAY BE TRUNCATED" block when len(records) >= effective_limit
(datastore_search_sql doesn't return a `total`, so this heuristic
is the best we can do). aggregate_data inherits the same guard
since it formats results via _format_sql_results.
4. Existing query_data/search_and_query header lines switched to the
X-of-Y phrasing for consistency. The TRUNCATED warning text now
matches the rest of the corpus.
Live verification (boston prod CKAN):
- search_datasets("parks", limit=5) → "5 of 21 matching dataset(s)
shown (limit=5; raise limit to see more)".
- execute_sql with LIMIT 100 returning 100 rows → MAY BE TRUNCATED
block at top, "100 rows returned (limit=100, true total
unknown...)".
Tests: +8 (204 -> 212). New: search_datasets count rendering, two
execute_sql truncation cases, four extract_top_level_limit unit tests
(simple, semicolon, missing, subquery-ignored), enforce-then-extract
round trip.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two phases of work, all on this branch:
Phase 1 — Infra/security/docs (original PR scope)
aggregate_dataSQL builder; 64 KB request body cap with 413 before JSON parsing.docs/AWS_DEPLOYMENT.md,docs/SECURITY.md,stdio_bridge.py,CLAUDE.md.Phase 2 — GPT-4o tool-chaining + CKAN ergonomics (added 2026-04-30)
Real-world failures observed via the deployed Lambda informed each fix:
2f3c09f) — fattened tool descriptions with explicit "Next step" hints;suggested_resource_idsurfaced at top of search/dataset output; new compositeckan__search_and_querytool that internally chainssearch_datasets+query_data.ddcee2e,668e24f) — Boston datasets attach 5–7 download-only resources (GeoJSON, KML, SHP, PDF) plus a single CSV that's actually loaded into the datastore. Plugin now picks the queryable one and labels each resource[QUERYABLE]vs[DOWNLOAD-ONLY]. Fixes the parks query that 404'd.whereclauses (5825317) — newwhereargument onquery_dataandsearch_and_querysupportingeq, ne, gt, gte, lt, lte, in, not_in, like, ilike, is_nullviaSafeSQLBuilder. Routes throughdatastore_search_sqlwhen set. Fixes "311 closed on 2026-04-29" (returns 85, the true count).Filterable columnsblock listing field names + types so the next pivot is a one-shot.resource_name(c1608e2) — Boston's 311 dataset has 22 queryable CSVs (rolling NEW SYSTEM + per-year archives 2011–2026); the model couldn't see them. Sibling block now lists them; newresource_nameargument picks one by case-insensitive substring match (resource_name="2020"→311 SERVICE REQUESTS - 2020).d9b6fca,47fbdca) — model was reading "Found 100 record(s)" as a count when 100 was the LIMIT. CKAN'stotal(fromdatastore_search) andcount(frompackage_search) now surfaced. SQL path does aCOUNT(*)follow-up whenlen(records) >= limit.execute_sqlextracts the effective LIMIT and warns when truncated. Phrasing switched to "100 of 531 rows returned" everywhere. Fixes "311 closed on 2016-04-29" (now reports 531 instead of the limit-100 misread).a8e9fea) — ruff F541.Why
Same ground rule as Phase 1: don't overwhelm
data.boston.gov. Phase 2 layers on a different goal — make the tools usable enough that GPT-4o (which is weaker at multi-step chaining than Claude) can answer real questions without fumbling. Each fix landed in response to a specific observed failure on the deployed prod Lambda.Test plan
a8e9feaboston-opencontext-mcp-prodin us-west-2):search_and_query("parks")→Park_FeaturesCSV, real park rowssearch_and_query("311", where={close_date: {gte: "2026-04-29", lt: "2026-04-30"}, case_status: "Closed"})→ 85 (matches manual SQL count)search_and_query("311", resource_name="2016", where={closed_dt: {gte: "2016-04-29", lt: "2016-04-30"}})→100 of 531 rows returned+ TRUNCATED block telling the model "the answer is 531, NOT 100"search_datasets("parks", limit=5)→5 of 21 matching dataset(s) shown🤖 Generated with Claude Code